Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Jan 27, 2016

As benchmarked and discussed here: https://github.com/apache/spark/pull/10786/files#r50038294, benefits from codegen, the declarative aggregate function could be much faster than imperative one.

@davies
Copy link
Contributor Author

davies commented Jan 27, 2016

cc @mengxr

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50240 has started for PR 10960 at commit 61edd5e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a Cast() here is very expensive

@SparkQA
Copy link

SparkQA commented Jan 28, 2016

Test build #50265 has finished for PR 10960 at commit 3c8d737.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those if branches are important to save computation for low-order statistics. Even we won't use CentralMomentAgg for second-order statistics, it is still good to keep them.

@mengxr
Copy link
Contributor

mengxr commented Jan 28, 2016

@davies Did you get a chance to test whole-stage codegen with higher-order statistics like skewness? If it works, the cleanest solution would be changing CentralMomentAgg to declarative and then make all existing univariate summary statistics call it.

@SparkQA
Copy link

SparkQA commented Jan 28, 2016

Test build #50294 has finished for PR 10960 at commit ae83955.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CentralMomentAgg(child: Expression) extends DeclarativeAggregate
    • case class Kurtosis(child: Expression) extends CentralMomentAgg(child)
    • case class Skewness(child: Expression) extends CentralMomentAgg(child)
    • case class Echo(child: Expression) extends UnaryExpression

@SparkQA
Copy link

SparkQA commented Jan 28, 2016

Test build #50297 has finished for PR 10960 at commit 1481bb4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies davies changed the title [SPARK-12963] Improve performance of stddev/variance [SPARK-12963] Improve performance of stat functions Jan 28, 2016
Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala
@SparkQA
Copy link

SparkQA commented Jan 28, 2016

Test build #50304 has finished for PR 10960 at commit 1b95b7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jan 29, 2016

@davies side note: The JIRA number is wrong.

@davies davies changed the title [SPARK-12963] Improve performance of stat functions [SPARK-12913] Improve performance of stat functions Jan 29, 2016
@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50322 has finished for PR 10960 at commit ae78e81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies davies changed the title [SPARK-12913] Improve performance of stat functions [SPARK-12913] [SQL] Improve performance of stat functions Jan 29, 2016
Davies Liu added 2 commits January 29, 2016 08:06
Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala
@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50384 has finished for PR 10960 at commit 1086810.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SetDatabaseCommand(databaseName: String) extends RunnableCommand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation of Corr/Covar have better accuracy, so updated the tests to match that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to see we can tolerate some small numerical differences in query tests. But this is out of scope here.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50386 has finished for PR 10960 at commit 383c193.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Corr(x: Expression, y: Expression) extends DeclarativeAggregate
    • abstract class Covariance(x: Expression, y: Expression) extends DeclarativeAggregate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on EqualTo => ===

@mengxr
Copy link
Contributor

mengxr commented Feb 2, 2016

@davies I made one pass. It would be nice to have a JIRA for checking query result with tolerance on numerical differences, because the result might change (though unlikely) if we merge the partial results in a different order.

@davies
Copy link
Contributor Author

davies commented Feb 2, 2016

@mengxr Thanks for reviewing this, I should had addressed all your comments.

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50560 has finished for PR 10960 at commit 9b74195.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class PrintToStderr(child: Expression) extends UnaryExpression

@mengxr
Copy link
Contributor

mengxr commented Feb 2, 2016

test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for some miscommunication. The previous inline comments are useful here because Lit(0.0) carries no information. The comments are not necessary when the variable names can clearly tell what they are. Please recover the inline comments for initial values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no difference for these initial values, the order does not matter here. Do you still think we should keep those comments? Or should I change to use fill()?

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50565 has finished for PR 10960 at commit 9b74195.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class PrintToStderr(child: Expression) extends UnaryExpression

@mengxr
Copy link
Contributor

mengxr commented Feb 2, 2016

LGTM pending Jenkins. It is great to see 5x speedup!

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50569 has finished for PR 10960 at commit 5f98588.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class PrintToStderr(child: Expression) extends UnaryExpression

@davies
Copy link
Contributor Author

davies commented Feb 2, 2016

Merging this into master, thanks!

@asfgit asfgit closed this in be5dd88 Feb 2, 2016
@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50572 has finished for PR 10960 at commit fe6fe50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants